Mercari Price Suggestion

Mercari Price Suggestion

In this article, we analyze and develop models using Bayesian Methods and PyStan for the Mercari Price Suggestion Challenge.

Description from Kaggle page:

Mercari, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace.

Table of Contents

The files consist of a list of product listings. These files are tab-delimited.

Columns Description
train_id or test_id the id of the listing
name the title of the listing. Note that we have cleaned the data to remove text that look like prices (e.g. \$20) to avoid leakage.
These removed prices are represented as (rm)
item_condition_id the condition of the items provided by the seller
category_name category of the listing
brand_name
price the price that the item was sold for. This is the target variable that you will predict. The unit is USD.
This column doesn't exist in test.tsv since that is what you will predict.
shipping 1 if shipping fee is paid by seller and 0 by buyer
item_description the full description of the item. Note that we have cleaned the data to remove text that look like prices (e.g. \$20) to avoid leakage.
These removed prices are represented as (rm)

We will model price for all categories. Our estimate of the parameter of product price can be considered a prediction. This article is inspired by A Primer on Bayesian Methods for Multilevel Modeling.

Preprocessing

Modeling

In this section, we would like to do a panel data analysis in which data are often collected over time and the same individuals. Then a regression is run over these two dimensions [1].

A common panel data regression model: $$y_{it}= \alpha + \beta x_{it}+\varepsilon _{it},$$ where

For more details, please see [1, 2].

Useful functions throughout the article:

Conventional Approaches

Complete Pooling

In complete pooling, we combine all the information from all the categories into a single pool of data. Thus, treat all cathegoris the same, and estimate a single price level. $$y_{i}= \alpha + \beta x_{i}+\varepsilon _{i},$$

However, a problem with this approach is the level might be different for different categories.

The model can be defind with the following considersations: \begin{align} \begin{cases} \alpha~\sim~N\left(0,\sigma^2\right),\\ \beta~\sim~N\left(0,\sigma^2\right),\\ \end{cases} \end{align}

where $\sigma$ has a half-Cauchy distribution. A half-Cauchy is one of the symmetric halves of the Cauchy distribution (if it is unspecified, the right half that's intended)

Cauchy Probability Density Function: $$\text{Cauchy}(y|\mu,\sigma) = \frac{1}{\pi \sigma} \ \frac{1}{1 + \left((y - \mu)/\sigma\right)^2}.$$

Moreover, let

Parameter Description
x Shipping Fee
y Price (Log)

To implement this, we use StanModel API from PyStan.

Complete Pooling Parameters


Note that this is similar to a Linear Regression model (Check this article for more details regarding Linear Regression model).

Here, Intercept is the same as $\alpha$ and the coefficient of Shipping Fee is the same as $\beta$.

No Pooling (Unpooled Model)

We assume that there is no connection at all between the Price (Log) levels in the different categories. In other words, we model price in each category independently. That is

$$y_{i}= \alpha_{j[i]} + \beta x_{i}+\epsilon_i,$$

The model can be defind with the following considersations: \begin{align} &\epsilon_i \sim N(0, \sigma_y^2)\\ &\alpha_{j[i]}, \beta \sim N(\mu, \sigma^2) \end{align}

A plot of the ordered Estimates:

A Visual comparisons between the pooled and unpooled estimates for a subset of categories:

Multilevel and Hierarchical models

There are Few approaches that we discuss here. For example,

Partial Pooling - the Simplest

The simplest possible partial pooling model for the retail price dataset is one that simply estimates prices, $\alpha$, with no other predictors and ignoring the effect of shipping, $\beta$.

In doing so, let, $$y_i = \alpha_{j[i]} + \epsilon_i$$

where \begin{align} &\epsilon_i \sim N(0, \sigma_y^2)\\ &\alpha_{j[i]} \sim N(\mu_{\alpha}, \sigma_{\alpha}^2) \end{align}

Standard Error $$SE = \frac{\sigma} {\sqrt{n}}$$

The unpooled estimates are more imprecise than the partial pool estimates.

Partial Pooling - Varying Intercept

This model allows intercepts to vary across the category, according to a random effect.

$$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$$

where \begin{align} &\epsilon_i \sim N(0, \sigma_y^2)\\ &\alpha_{j[i]} \sim N(\mu_{\alpha}, \sigma_{\alpha}^2) \end{align}

Prediction

Lets's pick a category. For example,

Partial Pooling - Varying Slope model

An alternative would be

$$y_i = \alpha + \beta_{j[i]} x_{i} + \epsilon_i$$

where \begin{align} \begin{cases} \epsilon_i \sim N(0, \sigma_y^2)\\ \alpha \sim N(0, \sigma^2)\\ \beta_{j[i]} \sim N(\mu_{\beta}, \sigma_{\beta}^2) \end{cases} \end{align}

This model highlights the relationship between measured the logarithm of the price, the prevailing price level, and the effect of who pays the shipping fee at which the measurement was made.

Prediction

Partial Pooling - Varying Slope and Intercept

The most general model allows both the intercept and slope to vary by category:

$$y_i = \alpha_{j[i]} + \beta_{j[i]} x_{i} + \epsilon_i$$

where \begin{align} \begin{cases} \epsilon_i \sim N(0, \sigma_y^2),\\ \alpha_{j[i]} \sim N(\mu_{\alpha}, \sigma_{\alpha}^2),\\ \beta_{j[i]} \sim N(\mu_{\beta}, \sigma_{\beta}^2). \end{cases} \end{align}

Prediction


References

  1. Kaggle Mercari Price Suggestion Challenge Dataset
  2. PyStan documentations
  3. A Primer on Bayesian Methods for Multilevel Modeling